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ABSTRACT 

We consider unordered XML, where the relative order among 
siblings is ignored, and propose two simple yet practical 
schema formalisms: disjunctive multiplicity schemas (DMS) , 
and its restriction, disjunction-free multiplicity schemas (MS). 
We investigate their computational properties and character- 
ize the complexity of the following static analysis problems: 
schema satisfiability, membership of a tree to the language of 
a schema, schema containment, twig query satisfiability, im- 
plication, and containment in the presence of schema. Our 
research indicates that the proposed formalisms retain much 
of the expressiveness of DTDs without an increase in com- 
putational complexity. 

1. INTRODUCTION 

When XML is used for document- centric applications, the 
relative order among the elements is typically important e.g., 
the relative order of paragraphs and chapters in a book. On 
the other hand, in case of data-centric XML applications, 
the order among the elements may be unimportant [I]. In 
this paper we focus on the latter use case. As an example, 
take a trivialized fragment of an XML document containing 
the DBLP repository in Figure [T] While the order of the 
elements title, author, and year may differ from one publi- 
cation to another, it has no impact on the semantics of the 
data stored in this semi-structured database. 

A schema for XML is a description of the type of admissi- 
ble documents, typically defining for every node its content 
model i.e., the children nodes it must, may or cannot contain. 
For instance, in the DBLP example, we shall require every 
article to have exactly one title, one year, and one or more 
author's. A book may additionally contain one publisher 
and may also have one or more editor's instead of author's. 
A schema has numerous important uses. For instance, it 
allows to validate a document against a schema and iden- 
tify potential errors. A schema also serves as a reference for 
any user who does not know yet the structure of the XML 
document and attempts to query or modify its contents. 
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Figure 1: A trivialized DBLP repository. 

The Document Type Definition (DTD), the most widespread 
XML schema formalism for (ordered) XML O [11], is es- 
sentially a set of rules associating with each label a regu- 
lar expression that defines the admissible sequences of chil- 
dren. The DTDs are best fitted towards ordered content 
because they use regular expressions, a formalism that de- 
fines sequences of labels. However, when unordered content 
model needs to be defined, there is a tendency to use over- 
permissive regular expressions. For instance, the DTD be- 
low corresponds to the one used in practice for the DBLP 
repository: 



dblp 
article 
book 



(article \ book)* 

(title I year \ author)* 

(title I jear j author \ editor \ publisher)* 



This DTD allows an article to contain any number of ti- 
tle, year, and author elements. A book may also have any 
number of title, year, author, editor, and publisher elements. 
These regular expressions are clearly over-permissive because 
they allow documents that do not follow the intuitive guide- 
lines set out earlier e.g., a document containing an article 
with two titJe's and no author should not be admissible. 

While it is possible to capture unordered content models 
with regular expressions, a simple pumping argument shows 
that their size may need to be exponential in the number of 
possible labels of the children. In case of the DBLP repos- 
itory, this number reaches values up to 12, which basically 
precludes any practical use of such regular expressions. This 
suggests that over-permissive regular expressions may be em- 
ployed for the reasons of conciseness and readability. 

The use of over-permissive regular expressions, apart from 
allowing documents that do not follow the guidelines, has 
other negative consequences e.g., in static analysis tasks that 
involve the schema. Take for example the following two twig 



Pvobleui of intevest 


DTD 


DMS 


disjunction- free DTD 


MS 


Schema satisfiability 


PTIME t8!|19J 


PTIME (Prop. |4JJ 


PTIME 8 19 


PTIME (Prop. |4ji 


Membership 


PTIME 8 19 


PTIME (Prop. |4JJ 


PTIME 8 19. 


PTIME (Prop. |4jJ 


Schema containment 


PSPACE-cT/PTIME [81IT9] 


PTIME (Th. IMJ 


coNP-hT /PTIME I81IT4I 


PTIME (Th. KB 


Query satisfiability-!- 


NP-c g] 


NP-c (Prop. KM 


PTIME 4 


PTIME (Th. KB 


Query implication-f 


EXPTIME-c [H] 


EXPTIME-c (Prop. KB 


PTIME (Cor. k.lOll 


PTIME (Th. KB 


Query containmcntf 


EXPTIME-c [T7l 


EXPTIME-c (Prop. |43J 


coNP-c (Cor. 14.1011 


coNP-c (Th. KB 



T when non-deterministic regular expressions are used, -f for twig queries. 



Table 1: Summary of complexity results. 



queries [2l[2T]: 

/dblp/book[author = "C. Papadimitriou^''] 
/dblp/book[author = "C. Papadi7nitriou"][title] 

The first query selects the elements labeled book, children 
of dblp and having an author containing the text "C. Pa- 
padimitriou." The second query additionally requires that 
book has a title. Naturally, these two queries should be 
equivalent because every book element should have a title 
child. However, the DTD above does not capture properly 
this requirement, and, consequently, the two queries are not 
equivalent w.r.t. this DTD. 

In this paper, we study two new schema formalisms: the 
disjunctive multiplicity schema (DMS) and its restriction, 
the dtsjunction-free multiplicity schema (MS). While they 
use a user-friendly syntax inspired by DTDs, they define un- 
ordered content model only, and, therefore, they are better 
suited for unordered XML. A DMS is a set of rules associat- 
ing with each label the possible number of occurrences for all 
the allowed children labels by using multiplicities: "*" (0 or 
more occurrences), (1 or more), "?" (0 or 1), "1" (exactly 
1 occurrence; often omitted for brevity). Additionally, alter- 
natives can be specified using restricted disjunction ("|") and 
all the conditions are gathered with unordered concatenation 
("II"). For instance, the following DMS captures precisely the 
intuitive requirements for the DBLP repository; 

dblp > article* \\ book* 
article title \\ year \\ author^ 

book title \\ year \\ pu Wisher' || (author"'" | editor''') 

In particular, an article must have exactly one title, exactly 
one year, and at least one author. A book may additionally 
have a publisher and may have one or more editor's instead 
of author's. Note that, unlike the DTD defined earlier, this 
DMS does not allow documents having an article with sev- 
eral title's or without any author. 

There has been an attempt to use DTD-like rule based schemas 
to define unordered content models by interpreting the reg- 
ular expressions under commutative closure [3]: essentially, 
an unordered collection of children matches a regular ex- 
pression if there exists an ordering that matches the regular 
expression in the standard way. However, testing whether 
there exists a permutation of a word that matches a reg- 
ular expression is NP-complete [T3], which implies a sig- 
nificant increase in computational complexity of the mem- 
bership problem i.e., validating an XML document against 
the schema. The schema formalisms proposed in this pa- 
per, DMS and MS, can be seen as DTDs interpreted under 
commutative closure using restricted classes of regular ex- 



pressions. Two natural questions arise: do these restrictions 
allow us to avoid the increase in computational complexity, 
and how much of the expressiveness of DTDs is retained. 
The answers are generally positive. There is no increase in 
computational complexity but also no decrease (cf. Table [l]). 
Furthermore, the proposed schema formalisms seem to cap- 
ture a significant part of the expressiveness of DTDs used in 
practice (Section [5}. 

We study the complexity of several basic decision problems: 
schema satisfiability, membership of a tree to the language 
of a schema, containment of two schemas, twig query sat- 
isfiability, implication, and containment in the presence of 
schema. Table [T] contains the summary of complexity results 
compared with general DTDs and disjunction-free DTDs. 
The lower bounds for the decision problems for DMS and 
MS are generally obtained with easy adaptations of their 
counterparts for general DTDs and disjunction-free DTDs. 
To obtain upper bounds we develop several new tools. De- 
pendency graphs for MS and a generalized definition of an 
embedding of a query help us to reason about query sat- 
isfiability, query implication, and query containment in the 
presence of MS. An alternative characterization of DMS with 
characteristic sets is used to reduce the containment of DMS 
to the containment of their characteristic sets, which can be 
tested in PTIME. We add that our constructions and results 
for MS extend easily to disjunction-free DTDs and allow to 
solve the problems of query implication and query contain- 
ment, which, to the best of our knowledge, have not been 
previously studied for disjunction- free DTDs. 

Because of space restriction, the proofs of all claims are omit- 
ted, they can be found in the appendix of the full version of 
the paper available at | http : // arxiv . org/ abs/ 1303 ■ 4277 [ 

Related work. Languages of unordered trees can be ex- 
pressed by logic formalisms or by tree automata. Boneva 
et al. ini E] make a survey on such formalisms and compare 
their expressiveness. The fundamental difference resides in 
the kind of constraints that can be expressed for the allowed 
collections of children for some node. We mention here only 
formalisms introduced in the context of XML. Presburger 
automata [5D], sheaves automata [T^, and the TQL logic [S] 
allow to express Presburger constraints on the numbers of 
occurrences of the different symbols among the children of 
some node. This is also equivalent to considering DTDs un- 
der commutative closure, similarly to [3]. The consequence 
of the high expressive power is that the membership problem 
is NP-complete for an unbounded alphabet |13) . Therefore, 
these formalisms were not extensively used in practice. Suit- 
able restrictions on Presburger automata and on the TQL 



logic allow to obtain the same expressiveness as the MSO 
logic on unordered trees (6] [7] . DMS are strictly less expres- 
sive than these MSO-equivalent languages. Static analysis 
problems involving twig queries were not studied for these 
languages. Additionally, we believe that DMS are more ap- 
propriate to be used as schema languages, as they were de- 
signed as such, in particular regarding the more user-friendly 
DTD-like syntax. As mentioned earlier, unordered content 
model can also be defined by DTDs defining commutatively- 
closed sets of ordered trees. An (ordered) tree matches such 
a DTD iff all tree obtained by reordering of sibling nodes 
also matches the DTD. This also turns out to be equally ex- 
pressive as MSO on unordered trees [6l[7]. However, such a 
DTD may be of exponential size w.r.t. the size of the alpha- 
bet and, moreover, it is PSPACE-complete to test whether a 
DTD defines a commutatively-closed set of trees [T6], which 
makes such DTDs unusable in practice. Some of the other 
schema languages also propose some support for unordered 
content model. XML Schema and RELAX NG allow to ig- 
nore the order of a bounded number of sibling nodes. In 
contrast, the Kleene star in DMS allows for unbounded un- 
ordered collections of children. Finally, Schematron allows 
to specify very general constraints on the number of occur- 
rences of symbols among the children of a node, in particular 
Presburger constraints are expressible. To the best of our 
knowledge, the static analysis problems we are interested in 
were not studied for these languages in the setting where 
unordered content model is allowed. 

2. PRELIMINARIES 

Throughout this paper we assume an alphabet E which is a 
finite set of symbols. 
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is a labeling function 
edges, and descq 
We assume that child. 
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Nq X Nq is a set of descendant edges. 



q n desCq 



and that the relation 



childq u descq is acyclic and we require every non-root node 
to have exactly one predecessor in this relation. By Tung 
we denote the set of all twig queries. Twig queries are often 
presented using the abbreviated XPath syntax [21) e.g., the 
query qo in Figure 2(b) can be written as r/*[*]//a. 



Embeddings. We define the semantics of twig queries using 
the notion of embedding which is essentially a mapping of 
nodes of a query to the nodes of a tree that respects the 
semantics of the edges of the query. Formally, for a query 
q e Twig and a tree t e Tree, an embedding of g in t is a 



function A : A'^„ 



Nt such that: 



1. \{rootq) = roott, 

2. for every {n,n') e childq, {\{n),\{n')) e childt, 

3. for every {n,n') e desCq, (A(n),A(n')) e (childt)^ (the 
transitive closure of childt), 

4. for every n e Nq, labqin) = * ox labqin) = labt{\(n)) . 



If there exists an embedding from g to t we say that t sat- 
isfies q and we write t \= q. By L{q) we denote the set of 
all the trees satisfying q. Note that we do not require the 
embedding to be injective i.e., two nodes of the query may 
be mapped to the same node of the tree. Figure [3] presents 
all embeddings of the query qo in the tree to from Figure |2l 



Trees. We model XML documents with unordered labeled 
trees. Formally, a tree f is a tuple [Nt, roott, labt, childt), 
where Nt is a finite set of nodes, roott e A^t is a distinguished 
root node, labt : A^t — > E is a labeling function, and childt <^ 
Nt X At is the parent-child relation. We assume that the 
relation childt is acyclic and require every non-root node to 
have exactly one predecessor in this relation. By Tree we 
denote the set of all finite trees. 
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(a) Tree to. (b) Twig query go. 
Figure 2: A tree and a twig query. 



Queries. We work with the class of twig queries, which are 
essentially unordered trees whose nodes may be additionally 
labeled with a distinguished wildcard symbol ★ ^ E and 
that use two types of edges, child (/) and descendant (//), 
corresponding to the standard XPath a^ces. Note that the 
semantics of //-edge is that of a proper descendant (and not 
that of descendant-or-self). Formally, a twig query g is a 
tuple (Nq, rootq, labq, childq, desCq), where Nq is a finite set 



Figure 3: Embeddings of go in to. 



Unordered words. An unordered word is essentially a 
multiset of symbols i.e., a function w : E ^ No mapping 
symbols from the alphabet to natural numbers, and we call 
the number w{a) the number of occurrences of the symbol a 
in w. We also write a e u; as a shorthand for w{a) =^ 0. An 
empty word e is an unordered word that has occurrences 
of every symbol i.e., £(a) = for every a e E. We often use 
a simple representation of unordered words, writing each 
symbol in the alphabet the number of times it occurs in 
the unordered word. For example, when the alphabet is 
E = {a, b, c}, Mo = aaacc stands for the function wo(a) = 3, 
wo(b) = 0, and woic) = 2. 

The (unordered) concatenation of two unordered words wi 
and W2 is defined as the multiset union wi ty W2 i.e., the 
function defined as {w\ w W2)(a) = wi(a) + W2{a) for all 
a e E. For instance, aaacc tb abbe = aaaabbccc. Note that e 
is the identity element of the unordered concatenation e ty 



w = w kt) e = w for all unordered word w. Also, given 
an unordered word w, by u;* we denote the concatenation 
TO ty . . . ty ui (i times). 



Rs maps symbols in E to disjunction-free multiplicity ex- 
pressions. By MS we denote the set of all disjunction-free 
multiplicity schemas. 



A language is a set of unordered words. The unordered 

concatenation of two languages Li and L2 is a language To define satisfiabily of a DMS (or MS) S by a tree t we first 
1/1*1/2 = {wityW2 I wi e Li,W2 e L2}. For instance, if Li = define the unordered word ch" of children of a node n e Nt 
{a, aac} and L2 = {ac, b,e}, then L1WL2 = {a, ab,aac, aabc, aaaccjof t i.e., c/i"(a) = |{m e TVt | {n, m) e childt a labt{m) = a}\. 

Now, a tree t satisfies S, in symbols t \= S, if labt{roott) = 



3. MULTIPLICITY SCHEMAS 

A multiplicity is an element from the set {*, +, ?, 0, 1}. We 
define the function |-| mapping multiplicities to sets of nat- 
ural numbers. More precisely: 



{i e ] 
{1}, 



i ^ 0}, [+1 
[01 = 



= {ieN 
{0}. 



i^l}, [?1 = {0,1}, 



Given a symbol a e E and a multiplicity M, the language 



of a'" 
L{a+) 



denoted 
= {a, aa, . 



L{a'"), is 
..},L(60). 



i e [MJ 
and L(c 



For example, 
{e,c}. 



A disjunctive multiplicity expression E is: 
E:=D','' II...II Dt'\ 
where for all 1 ^ i ^ n. Mi is a multiplicity and each Di is: 



Ml 



where for all 1 ^ j ^ fc, Mj is a multiplicity and a-, G E. 
Moreover, we require that every symbol a e E is present 
at most once in a disjunctive multiplicity expression. For 
instance, (a | b) \\ (c | d) is a disjunctive multiplicity expres- 
sion, but (a I 6) II c II (a | d) is not because a appears twice. 
A disjunction-free multiplicity expression is an expression 
which uses no disjunction symbol "|" i.e., an expression of 



the form a-^^ ^ [| . . . || Oj, , where for all 1 sg i ^ fc, the at 
are pairwise distinct symbols in the alphabet and the Mi 
are multiplicities. 

The language of a disjunctive multiplicity expression is: 



L(ai 

L{D' 
L{Dl'- II . . 



D 



w 



w e L{D) A i e [A/|}, 



When a symbol a (resp. a disjunctive multiplicity expression 
E) has multiplicity 1, we often write a (resp. E) instead 
of a'^ (resp. E^). Moreover, we omit writing symbols and 
disjunctive multiplicity expressions with multiplicity 0. Take 
for instance, .Eo = o-^ \\ {b \ c) \\ and note that both the 
symbols b and c as well as the disjunction (6 | c) have an 
implicit multiplicity 1. The language of Eo is: 

L{Eo) = {a^b^c'^d" \i,j,k,£en, i ^ 1, j -f fc = 1, £ sS 1}. 

Next, we formally define the proposed schema formalisms. 

Definition 3.1 A disjunctive multiplicity schema (DMS) is 
a tuple S = (roots, Rs), where roots e is a designated root 
label and Rs maps symbols in E to disjunctive multiplicity 
expressions. By DMS we denote the set of all disjunctive 
multiplicity schemas. A disjunction- free multiplicity schema 
(MS) 5* = {roots , Rs) IS a restriction of the DMS, where 



roots and for any node n e Nt, ch" e L{Rs{labt{n))). By 
L{S) c Tree we denote the set of all the trees satisfying 5*. 

In the sequel, we represent a schema 5* = {roots , Rs) as 
a set of rules of the form a Rs{a), for any a e E. If 
L{Rs{a)) = e, then we write a ^ e or we simply omit 
writing such a rule. 

Example 3.2 We present schemas Si, S2, S3, S4, illustrat- 



ing the formalisms defined above, 
and the rules: 



They have the root label r 



Si 
S2 

54 



a II 6=^ II 
c II 6 II a 



{a I b)' II c 
(a I & I c)* 



Si and S2 are MS, while S3 and Si are DMS. The tree to 
from Figure 2(a) satisfies only Si and S3. 

4, STATIC ANALYSIS 

We first define the problems of interest and we formally state 
the corresponding decision problems parameterized by the 
class of schema and, when appropriate, by a class of queries. 

Schema satisfiability - checking if there exists a tree sat- 
isfying the given schema: 

SATs = {S* e <S I at e Tree, t ^ S). 

Membership - checking if the given tree satisfies the given 
schema: 

MEMBs = {{S,t) e 5 X Tree | t |= S). 

Schema containment - checking if every tree satisfying 
one given schema satisfies another given schema: 

CNTs = {(Si,S2) e 5 X 5 I L{Si) c L{S2)]. 

Query satisfiability by schema - checking if there exists 
a tree that satisfies the given schema and the given query: 

SAT5,Q = {{S,q)sSx Q I 3teL{S). t^q}. 

Query implication by schema - checking if every tree 
satisfying the given schema satisfies also the given query: 

IMPLs,s = {{S,q) e 5 X Q I Vt e L{S). t \= q}. 

Query containment in the presence of schema - check- 
ing if every tree satisfying the given schema and one given 
query also satisfies another given query: 

CNTs,Q = {{p,q, S) e QxQxS \ Vt e L{S). t\=p^t^q}. 

We next study these decision problems for DMS an MS. 



4.1 Disjunctive multiplicity schema 

In this section we present the complexity results for DMS. 
Given a schema S, a dynamic programming algorithm can 
check whether there exists a tree satisfying S. Moreover, 
given a tree t and a schema S, to check whether t \= S, a. 
simple algorithm has to count for each node of t the number 
of children labeled with any label in E and then to verify 
if those numbers are consistent with the corresponding rule 
from S. 

Proposition 4.1 SATdms anrf MEMBdms are m PTIME. 

Testing the containment of two DMS reduces to testing, for 
each symbol in the alphabet, the containment of the asso- 
ciated disjunctive multiplicity expressions. We propose an 
alternative definition of the language of a disjunctive multi- 
plicity expression using three characteristic sets. Then, we 
show that the inclusion of two disjunctive multiplicity ex- 
pressions is equivalent to the inclusion of the characteristic 
sets. Recall that a e w means that w{a) =^ 0. For a disjunc- 
tive multiplicity expression we define: 

• The conflicting pairs of siblings Ce consisting of pairs 
of symbols in E such that E defines no word using both 
symbols simultaneously: 

Ce = {(ai, 02) e E X E I e L{E). ai e w a a2 e w}. 

• The extended cardinality map Ne captures for each 
symbol in the alphabet the possible numbers of its oc- 
currences in the unordered words defined by E: 

Ne = {{a, w{a)) e E x N | m e L{E)}. 

• The sets of required symbols Pe which captures sym- 
bols that must be present in every word; essentially, a 
set of symbols X belongs to Pe if every word defined 
by E contains at least one element from X: 

Pe = {X c E I Vw e L{E). 3a e X. a e w). 

As an example we take Eq = a+ || (fe | c) || d' . Because Pe is 
closed under supersets, we list only its minimal elements: 

Ceo = {{b, c), (c, b)}, Peo = {{a}, {b, c}, . . .}, 
Neo = {(&,0), (fe, 1), (c, 0), (c, 1), (d, 0), (d, 1), {a, 1), (a, 2), . . .}. 

The characteristic sets allow us to capture the containment 
of disjunctive multiplicity expressions: 

Lemma 4.2 Given two disjunctive multiplicity expressions 
El and E2, L{E2) c L{Ei) iff Ce^ c Ce^, Ne^ c Ne^, 
and Pei c Pg^ . 

We point out that A'^e may be infinite but it can be repre- 
sented in a compact manner using multiplicities. Also, Pe 
may be exponential in the size of E but it can be repre- 
sented with its c-minimal elements. Both representations 
allow easy use of the lemma above to test the inclusion. Con- 
sequently, 

Theorem 4.3 CNTdms is m PTIME. 



Query satisfiability for DTDs is known to be NP-complete 
and the proof from 14) can be easily adapted to DMS. 

Proposition 4.4 SPC^DMS.Twig is NP-complete. 

The complexity results for query implication and query con- 
tainment in the presence of DMS follow from the EXPTIME- 
completeness proof from [17] for twig query containment in 
the presence of DTDs. 

Proposition 4.5 IMPL_DMS,r™9 and Ci^T dms, Twig are EXP- 
TIME-complete. 

4.2 Disjunction-free multiplicity schema 

In this section we present the complexity results for MS. 
Although query satisfiability and query implication are in- 
tractable for DMS, these problems become tractable for MS 
because they can be reduced to testing embedding of queries 
in some dependency graphs that we define in the sequel. Re- 
call that MS use expressions of the form a^'^ II • ■ • II 1*^" ■ 

Definition 4.6 Given an MS S = {roots, Rs), the depen- 
dency graph ofS is a directed rooted graph Gs = (E, roots, Es) 
with the node set E, where roots is the distinguished root 
node and {a,b) e Es if Rs{a) = . . . || 6^"'^ || . . . and M e 
Furthermore, the edge {a,b) is called unliable 
if e |A/]] (i.e., Ad is * or?), otherwise {a,b) is called 
non-nuUable (i.e., AI is + orl). The universal dependency 
graph of an MS S is the subgraph Gg containing only the 
non-nullable edges. 

In Figure|4]we present the dependency graphs for the schema 
Ss containing the rules r a"'" || b* , a b' , b ^ e. 

r r 

/ \ \ 
b < a b a 

Figure 4: Dependency graph Gs^ and universal de- 
pendency graph Gs^ for schema 5*5. 

An MS S is pruned if is acyclic. We observe that any MS 
has an equivalent pruned version which can be constructed 
in PTIME by removing the rules for the labels from which 
a cycle can be reached in the universal dependency graph. 
Note that a schema is satisfiable iff no cycle can be reached 
from its root in the universal dependency graph. From now 
on, we assume w.l.o.g. that all the MS that we manipulate 
are pruned. 

We generalize the notion of embedding as a mapping of the 
nodes of a query q to the nodes of a rooted graph G = 
(E, root,E), which can be either a dependency graph or a 
universal dependency graph. Formally, an embedding of q in 
G is a function A : Nq —>■ E such that: 

1. \{rootq) = root, 

2. for every {n,n') E childq, (A(n),A(n')) e E, 

3. for every {n,n') e descq, (A(n),A(n')) e _E+ (the tran- 
sitive closure of E), 



4. for every n e Nq, labqin) = ★ or labq(n) = \(n). 

If there exists an embedding from q to G, we write G ^ q. 
The dependency graphs and embeddings capture satisfiabil- 
ity and imphcation of queries by MS. 

Lemma 4.7 For a twig query q and an MS S we have: 1) q 
is satisfiable by S iffGs < 5, 2) q is implied by S iff G^ < q. 

Furthermore, testing the embedding of a query in a graph 
can be done in polynomial time with a simple bottom-up 
algorithm. Hence, 

Theorem 4.8 SATa/s, Twig and\^>AP'LiMS ,Twig are in PTIME. 

The intractability of the containment of twig queries [T5] im- 
plies the coNP-hardness of the containment of twig queries 
in the presence of MS. Proving the membership of the prob- 
lem to coNP is, however, not trivial. Given an instance 
(p, q, S) , the set of all the trees satisfying p and S can be 
characterized with a set G{p, S) containing an exponential 
number of polynomially-sized graphs and p is contained in 
q in the presence of S iff the query q can be embedded into 
all the graphs in Q{p, S). This condition is easily checked by 
a non-deterministic Turing machine. 

Theorem 4.9 CNTms.Twis is coNP- complete. 

We also point out that the results are easily adapted to 
disjunction-free DTDs, which allows us to state results which, 
to the best of our knowledge, are novel. 

Corollary 4.10 \M'Phdisj-free^DTD,Twig is lu PTIME and 
CNTdisj^free- DTD, Twig ^s coNP-complete. 

5. EXPRESSIVENESS OF DMS 

We compare the expressive power of DMS and DTDs and 
focus on real-life applications. First, we introduce a sim- 
ple tool for comparing regular expressions with disjunctive 
multiplicity expressions, and by extension, DTDs with DMS. 
For a regular expression R, the language L{R) of unordered 
words is obtained by removing the relative order of symbols 
from every ordered word defined by R. A disjunctive multi- 
plicity expression E captures R if L{E) = L(R). A DMS S 
captures a DTD D if for every symbol the disjunctive mul- 
tiplicity expression on the rhs of a rule in 5* captures the 
regular expression on the rhs of the corresponding rule in 
D. We believe that this simple comparison is adequate be- 
cause if a DTD is to be used in a data-centric application, 
then supposedly the order between siblings is not impor- 
tant. Therefore, a DMS that captures a given DTD defines 
basically the same type of admissible documents, without 
imposing any order among siblings. 

We use the comparison on the XMark [18] benchmark and 
the University of Amsterdam XML Web Collection [11] . We 
find that all 77 regular expressions of the XMark benchmark 
are captured by DMS rules, and among them 76 by MS 
rules. As for the DTDs found in the University of Amster- 
dam XML Web Collection, 84% of regular expressions (with 
repetitions discarded) are captured by DMS rules and among 
them 74.6% by MS rules. Moreover, 55.5% of full DTDs in 



the collection are captured by DMS and among them 45.8% 
by MS. Note that these figures should be interpreted with 
caution, as we do not know which of the considered DTDs 
were indeed intended for data-centric applications. We be- 
lieve, however, that these numbers give a generally positive 
answer to the question of how much of the expressive power 
of DTDs the proposed schema formalisms, DMS and MS, 
retain. 

6. CONCLUSIONS AND FUTURE WORK 

We have studied the computational properties and the ex- 
pressive power of new schema formalisms, designed for un- 
ordered XML: the disjunctive multiplicity schema (DMS) 
and its restriction, the disjunction-free multiplicity schema 
(MS). DMS and MS can be seen as DTDs using restricted 
classes of regular expressions and interpreted under commu- 
tative closure to define unordered content models. These 
restrictions allow on the one hand to maintain a relatively 
low computational complexity of basic static analysis prob- 
lems while retaining a significant part of expressive power of 
DTDs. 

An interesting question remains open: are these the most 
general restrictions that allow to maintain a low complexity 
profile? We believe that the answer to this question is nega- 
tive and intend to identify new practical features that could 
be added to DMS and MS. One such feature are numeric 
occurrences 12 of the form at"'"'^ that generalize multiplic- 
ities by requiring the presence of at least n and no more than 
m elements a. It would also be interesting to see to what 
extent our results can be used to propose hybrid schemas 
that allow to define ordered content for some elements and 
unordered model for others. 
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APPENDIX 

A. DETAILED PROOFS 

A.l Disjunctive Multiplicity Schema 

Wc first introduce the notion of normal form for disjunctive 
nmltiplicity expressions, then wo prove two additional lem- 
mas related to the alternative definition with characteristic 
sets, and, finally, we prove the complexity results stated for 
DMS. 

Disjunctive multiplicity expressions in normal form 
Given a disjunctive multiplicity expression E = ^ II • • • II 
Dm"" 1 we denote by Eo^ the set of symbols used in the dis- 
junction from Di and by M" the multiplicity corresponding 
to a symbol a. We say that E is in normal form if the 
following two conditions are satisfied: 

Vi. 1 < i < m. Mi # 1 Mi = + A Va e Ed, . = 1 

Vi. 1 < i < m. 3a £ Ed,. e [M"| ^ Va' e Edj. e {M"'} 

Any expression of the form D'^ can be rewritten as an equiv- 
alent disjunctive multiplicity expression in normal form us- 
ing the following rules: 

. (af^ I ... jaf")* goestoa? \\...\\aZ 

• (af^^ I •■■ I <^n"y goes to (af^^ I ••• I Om"), where 
Vi. 1 ^ i ^ n. IM'il = {0} u {Mil 

• (af 1 I ... I where 3i 1 ^ i ^ n. e |Mi] goes 

to (afi I ... I of"), where Vi 1 ^ i < n. |M^1 = 
{0} u IM] 

• (af 1 I ... I af ")+, where 3i 1 ^ i ^ n. e |Mi] goes 
to of II ■ • • II al 

• (of 1 I ... I of ")+, where Vi. 1 < i < n. §t |Mi] goes 
to (oi I ... I a„) + 

Alternative definition with characteristic sets 
An unordered word w is consistent with a tuple {Ce, Ne, Pe) 
corresponding to a disjunctive multiplicity expression E, de- 
noted w 1= {Ce,Ne,Pe) ifi' w is consistent with Ce,Ne, 
and Pe, respectively. Formally: 

w^Ce ^ 

V(oi, 02) e Ce- (ai e w ^ a-z ^ w) a (02 e w oi ^ w) 
w \= Ne Vo 6 E. (a, w{a)) e Ne 

w \= Pe VX 6 Pe- 3a e X. ae w 



We use the characteristic sets to give an alternative defini- 
tion of the membership of an unordered word to the language 
of a disjunctive multiplicity expression. 

Lemma A.l An unordered word w belongs to the language 
of a disjunctive multiplicity expression E iff it is consistent 
with the tuple {Ce,Ne,Pe)- 

Proof. For the if part, consider the tuple {Ce, Ne,Pe) 
corresponding to a disjunctive multiplicity expression E = 
L'f ^ II • • • II -Df ™ , and an unordered word w such that w |= 
{Ce,Ne,Pe)- We want to prove that w \= E, which by 
definition means that: 

3wi, . . . , Wm- W = Wl W ■ ■ ■ til Wm A Vz. 1 < i ^ m. Wi f= Z)f ' 

Assuming that E is in normal form, there are three cases for 
Df * , for every i such that 1 ^ i ^ m: 

1. l?f' = (oi I ... I ffln)^. In this case Wi \= Df^ is 
equivalent to 3j. 1 ^ j ^ n. Oj e Wi. Because w satisfies 
Pe and {oi, . . . , o„} e Pe, we infer that Wi \= Df * . 

2. * = (of 1 I ... I of") A yj. l^j^n.Of IMjj. In 
this case Wi \= Df* is equivalent to: 

3j. 1 ^ j s; n. w^{aj) e [MjJaVZ. 1 ^ / ^ n. (Z # j => o; ^ 

We know that {oi, . . . , On} e Pe, so 3j. 1 ^ j ^ n. Oj e 
Wi, and because the only x e N such that {aj,x) e Ne 
are from {0} u [MjH, we infer that {aj ,Wi{aj)) e Ne- 

We also know that (Vj, Z e {1, . . . , n}. j ¥^ I ^ (a^, oj) e 
Ce), which implies that (yj, I e {1, . . . , n}. j I a aj e 
Wi => ai f Wi). From the last two relations we infer 
that Wi \= Df * . 

3. Df ' = (of 1 I ... I of") A \fj. 1< i ^ n. £ [M,]. 
In this case Wi \= D^ ' is equivalent to: {3j. 1 sS j sS 
n. Wi{aj) e |IMjl\{0}) implies (VL 1 sS Z sc n. Z # j ^ 
ai ^ Wi). The reasoning is similar as for the previous 
case, the only difference is that now {oi, . . . , a„} ^ Pb, 
so we obtain Wi \= -Of ' even if none of the Oj is present 
in Wi. 

From the three cases presented above we conclude that w |= 
{Ce,Ne,Pe)^w\=E. 



For the only if part, consider a disjunctive multiplicity ex- 
pression E = of ^ II ■ ■ ■ II -Df "* and an unordered word w 
which belongs to the language of E. By definition, this 
means that: 

3wi , ■ ■ . , Wm- W = Wl W • • • W Wm A^t. 1 ^ t ^ TU. Wi \= ff ' 

Assuming that E is in normal form, there are three cases for 
of* , for every i such that 1 ^ i ^ m: 

1. £)f* = (ai I . . . I On)'^ . In this case Wi \= Df' is 
equivalent to 3j. 1 ^ j ^ n. Oj £ Wi, so {oi, ...,«„} £ 
Pe is satisfied. There are no conflicting pairs of symbols 
in {oi, . . . ,a„}. As Vj. 1 ^ j ^ n. Vx £ N. {aj,x) e Ne, 
we obtain that in Wi all the symbols have consistent 
cardinalities w.r.t. Ne. 



2. Df^ = (af ^ I ... 1 af") a Vj. 1 ^ j ^ n. Q t WA- 
In this case Wi ^ ^ is equivalent to 3j. 1 j 

n. Wi{aj) e {Mjl A. yi. 1 I n. {I ^ j ^ ai if Wi), 
which imphes that {ai, . . . ,an} e Pe is satisfied. 
The conflicting pairs of symbols are also satisfied, more 
precisely we know from the definition of Ce that (Vj, / e 
{!,..., n}. j =^ I ^ {aj,ai) e Ce) and Wi \= D- ' 
implies that (Vj, I e {!,..., n}. j =/= I a aj e Wi ^ ai f 
Wi), so there are no conflicts in Wi. 
Regarding the extended cardinality map, we know that 
Vj. 1 s; j s; n. ((a,, 0) eNEAVxe IMjj. (a,, x) e Ne), 
which is satisfied for the present symbol (since Wi \= 
Df^'-) and also for the symbols which are not present 
(since belongs to their extended cardinality map). 

3. Df' = (af 1 I ... I af") A Vj. 1 j s: n. e {Ad,}- In 
this case the reasoning for Ce and Ne is identical to 
the previous case. The difference is that now Pe is less 
restrictive, since {ai, . . . ,an} f Pe- 

From the three cases presented above we can conclude that 
w^E^w^{Ce,Ne,Pe) □ 

Lemma A. 2 Given two disjunctive multiplicity expressions 
El and E2: [Ce, Q Ce2 a Ne^ c Ne, a Pe, c P^J iff 
(Vw. w \= {Ce2 ,Ne2,Pe2) ^ -w \= {Ce, , Ne, ,Pe,)) 

Proof. For the if part, we prove by contraposition: 

• Ce, ^ Ce2 =^ 3(01,02) e Ce,- (0.1,02) f Ce2 =^ 
3(01,02) e E X E. {/3w e C{Ei)- oi e w a 02 e 
w) A (3w' e C(E2)- oi e to' A 02 e w') =^ {3w'- w' |= 
Ce2 aw"^Ce,) 

• Ne2 S2 Ne, ^ 3o e E. 3it; e £(£2). e ^(-Ei). w'{o) = 
w{a) => (3u). w \= Ne2 a w ^ Ne,) 

• Pe, ^ Pe2 ^ 3X c E. (Vw e C{Ei)- 3a e X. o e 
w) A (3it)' e C{E2)- ^a e X. a f w') => (3m'. w' \= 
Pe2 a w' ^Pe,) 

For the only if part, we take an unordered word w such 
that w \= {Ce2, Ne2, PE2) and we want to prove that w \= 
{Ce,, Ne,, Pe,), assuming that Ce, c Cb^, Ne2 ^ Ne,, 
and Pe, ^ Pg^ . If we consider that Ei = -Df{^^ II - - - II Dim'" i 
proving that w \= {Ce,, Ne,, Pe,) is equivalent (according 
to the previous lemma and the definition) to proving that: 

3W1 , . . . , Wm . TO = TOl ty ■ ■ ■ W Wm A^i - 1 ^ i ^ m. Wi \= -D^/^' 

Assuming that Ei is in normal form, there are three cases 
for -Dff^' , for every i such that 1 ^ i ^ m: 

1. -Of/^' = (oi I ... I On)^ - In this case to; \= -Off^' is 
equivalent to 3j. 1 sg j ^: n. Oj e Wi. The form of 
Dff^' implies that {oi, . . . , o„} e Pe, and since Pe, c 
Pe2, we infer that {ai,...,o„} e Pe2- Because to \= 
Pe2, then 3j. 1 ^ j ^ n. Oj e to. Since the labels 
in {oi, . . . , o„} appear only in -Di^^' , we conclude that 
3j- 1 5* j sS n. Oj £ Wi, so Wi \= -Dif" . 

2. i3ff" = (of 1 I . .J of") A Vj. 1 ^ j ^ n. ^ lAf.l. In 
this case Wi \= ' is equivalent to: 

3j- 1 sS j ^ n. TOi(aj) E |Mj| aVZ. 1 sS Z sS n. (/ # j ^ o; ^ i 



The form of -Dj^j^' implies that {oi, . . . , a„} e Pbj and 
since Pe, c , we infer that {oi, . . . , a„} e Pb^ • Be- 
cause TO \= Pe2 , then 3 j. 1 s£ j ^ n. Oj e to and as the 
labels in {oi, . . . ,o„} appear only in -Dii^% this implies 
3j- 1 ^ j ^ n. Oj e TOi. In the sequel we denote this 
present label with Oj. 

Because to |= Ne2 , we have {aj,Wi{aj)) e Ne2 , which is 
included in Ne,, so TOi(oj) e |Mj]. Moreover, we know 
that {{aj,ai) |l^Z^nAj^Z}c Ce, ^ Ce2, and 
because to |= we conclude 3j. 1 ^ j ^ n. Wi{aj) e 
IMjj A'il. 1 ^l sin. {I ^ j ^ ai f toO, so w, \= Of' . 

3. 75f " = (of 1 I ... I of") A Vj. 1 ^ j ^ n. e [Mj]. 
In this case Wi \= D- ^ is equivalent to: (3j. 1 ^ j ^ 
n. TOj(aj) e IIA/j|\{0}) implies (VZ. 1 sS Z s: n. i j ^ 
oi f Wi) and we obtain this using the same reasoning 
as in the previous case for Ce, and A''^^ . 

From the three cases developed above we obtain that: (to |= 

{Ce2,Ne2,Pe2) a Ce, c Ce2 a Ne2 ^ Ne, a Pe, c PgJ 
implies (to ^ (Cbi,A£;i,Pbj)). □ 

Proof of Lemma \4.2\ 

Lemma [4.21 follows directly from Lemmas lA. II and IA.2I □ 
Proof of Theorem \4.3\ 

Given two DMS Si and 52, S2 ^ 5*1 iff both schemas have 
the same root and for every symbol o e E we have C{Rs2 (o)) c 
C{Rs,{a))- From Lemma [4.21 we know that: 

E2CE1 4=^ {Ce, c Ce2 a Ne2 c Ne, a Pe, c PbJ 

We have proved that testing containment of two disjunctive 
multiplicity expressions is equivalent to testing inclusion of 
some characteristic sets and now we argue that we can ma- 
nipulate them in PTIME. For a disjunctive multiplicity ex- 
pression E = Df 1 II ... II P)*^'" , computing Ce is in PTIME: 

Ce = {(01,02) \ 3i- 1 sS: i ^ m. 01,02 e Ed^ a Mi = 1} 
u {(oi, 02) I Vi. 1 ^ i ^ m. oi ^ E15. } 

Given a symbol o e E, by N^{a) we denote the multiplic- 
ity M such that [M] = {x e N \ {a,x) e Ne}, which is a 
compact representation for a potentially infinite set. Test- 
ing Ne, c Ne2 is equivalent to testing Vo e E. N^^{a) c 
Ne2 i'^)- F^'^ ^ symbol a and a disjunctive multiplicity expres- 
sion E = P»f 1 II ... II D"", computing A|(a) is in PTIME, 
because for any a e E: 

'0, Vi. 1 :g i sj m. a ^ Eo^ 

x M", 3i. 1 sS i s; m. Ed = \a} 
N%{a) = i ' u^ \ i 

^ ' I ?, 3j. 1 s: i sS m. a e Ed, A Mi = 1 a M" e {?, 1} 

*, otherwise 

The size of Pe may be exponential in the size of the alphabet, 
so we are interested in defining the minimal sets of required 
symbols in any unordered word consistent with E\ 

P| = {A e Pb I ^X' e Pe- X' c X} 

For two disjunctive multiplicity expressions Ei and P2, we 
have: 

'0 Pe2 c Pe, VA £ Pl^. 3y £ P|^. y c A 



For a disjunctive multiplicity expression E = -D^^ II ■ • • II 
D^'" , computing P| is in PTIME: 

P| = {Ed^ I 1 s: i s; m a Va e Ed,. ^ M°} 

□ 

Proof of Proposition \4.4\ 

The NP-hardness proof from Proposition 4.2.1 from [l] can 
be easily adapted. The reduction from 3SAT to SPC£ DMS,Twig 
works as follows: we take a 3CNF formula ip = Ci over 

the variables xi, . . . , Xm, where each d is a disjunction of 3 
literals. Consider E = {r,t\, fi, . . . , tm, fm, Ci, . . . , C„} and 
the corresponding tuple {S, q) : 

• The schema 5* having the root label r and the rules: 

— r^{tl\ h) II • • ■ II {U^ I .fm) 

— tj ^ II • ■ • II C'jfc 1 1 ^ j ^ m. Xj appears in Cj^ 

— fi ^ II ... 11 Cj^ , 1 :g j sS m. -^Xj appears in 

. The query q = r[//Ci] . . . [//C„] 

For example, for the 3CNF formula over the variables x\, . . . ,X4,: 
ifio = {xi V -^X2 V 2:3) A (— 'a::i v 2:3 v ^2:4) we have the schema 
S containing the rules: 

^-(^1 1/1) II (i2|/2)ll (^3 1/3) II {U\f4) 



tl 


■Ci 




- Ci II C2 


fl - 


■ C2 


h- 


-* e 




— > e 


tA - 


^ e 


/2- 


■Ci 


/4- 


-C2 



and the query: 

g = A[/M][//C2] 

The formula 93 is satisfiable iff {S,q) e SATDMS,Twig- The 
described reduction works in polynomial time in the size of 
the input formula Lp. Moreover, the NP upper bound for 
SAToMS.Twig follows from Theorem 4.4 from W. □ 

Proof of Proposition 14.51 

Theorem 4.4 from ^ITj implies that the problems IMPLdtd.Tiuib, 
IMPLdms , Twig 5 and CNTdms , Twig S'fG also in EXPTIME. 
The EXPTIME-hardness proof of twig containment in the 
presence of DTDs (Theorem 4.5 from [17]) has been done us- 
ing a reduction from Two-player corridor tiling problem and 
a technique introduced in [15]. In the proof from 17 , when 
testing inclusion p Cg g, p is chosen such that it satisfies any 
tree in S, hence lMPLnTD,Twig is also EXPTIME-complete. 
Furthermore, Lemma 3 in [15] can be adapted to twig queries 
and DMS: for any S e DMS and twig queries go, 9i, • • • , 9m 
there exists S' e DMS and twig queries q and q' such that 

go gi u . . . u (Jm <^=^ q C5, q'. 

Because the DTD in [17j can be captured with DMS, from 
the last two statements we conclude that lMPLnMS,Twig and 
CNT DMS, Twig are also EXPTIME-complete. □ 

A.2 Disjunction-free Multiplicity Schema 

We first present some of the technical tools which help us 
to reason about the disjunction-free multiplicity schemas. 
Next, we use these tools to prove our results. 



Graph simulation 

A simulation of a rooted graph (either dependency graph or 
universal dependency graph) G = (E, root, E) in a tree t is 
a relation R c x Nt such that: 

1. {root, roott) e R 

2. for every (o, n) e 7?, (a, a') E E, there exists n' e Nt 
such that (n, n') e childt and {a',n') e R 

3. for every (a, n) e R. labt{n) = a 

Note that _R is a total relation for the nodes of the graph 
reachable from the root i.e., for every a G E reachable from 
root in G, there exists a node n e Nt such that (a, n) e R. If 
there exists a simulation from G to t, we write t < G. The 
language of a graph is jC{G) = {t e Tree \ t < G}. 

A rooted graph Gi = (E, root,E\) is a subgraph of another 
dependency graph G2 = (E, root, E2) if E\ c E2- For a de- 
pendency graph G = (E, root, E), we define the partial order 
on the subgraphs of G: given Gi and G2 two subgraphs 
of G, Gi G2 if Gi is a subgraph of G2. Note that the 
relation sSg is reflexive, antisymmetric, and transitive, thus 
being an order relation. Moreover, it is well-founded and it 
has a minimal element, that we denote Go for a graph G. 
Let Go = (E, root, 0) and indeed, for any G' subgraph of G 
we have Go ^Sg G'. In the sequel, we assume w.l.o.g. that all 
the subgraphs that we use in our proofs have the property 
that every edge can be part of a path starting at the root. 

Lemma A. 3 For any disjunctton-free multiplicity schema 
S, its universal dependency graph can be simulated in any 
tree t which belongs to the language of S: 

ySeMS. yteC{S). t < Gs 

Proof. Consider an MS S and its universal dependency 
graph Gg. Let i be a tree which belongs to £(5). We want 
to construct a witness relation i? c E x TVt for t < Gg and 
the proof goes by induction on the structure of Gg, using the 
well-founded order ^Gg, as defined above. Let P{G) denote 
the statement t < G. Let G be a subgraph of Gg. The 
induction hypothesis is that for all G' =Sgj G and G' ^ G, 
there exists a relation R' witness of the simulation t < G' 
and we are going to construct R that witnesses t < G. 

For the base case, we take the minimal element for the rela- 
tion ^Sgj let it Go = (E, root, 0), then P{Go) holds for the 
relation _Ro = {{root, roott)}, so the subgraph containing no 
edge can be simulated in t. 

For the induction case, let G a subgraph of Gg. By the 
induction hypothesis, we know that P{G') holds, for every 
G' ^G| G. Consider a subgraph G' of G such that G con- 
tains exactly one additional edge w.r.t. G', let the additional 
edge (a, a') and R' the witness relation for t < G'. Because 
G' ^Gg G and {a, a') is the only additional edge, we know 
that R! already contains images for a in t i.e., there exists a 
node n such that {a, n) £ R! . We construct the relation 7? as 
the union of R' with {{a ,n') \ labt{n') = a' a (3n. {n,n') G 
childt A {a,n) G i?')}. The set of tuples that we add is not 
empty because the edge (a, a') belongs to the universal de- 
pendency graph Gg , so for any node labeled by a in the tree 



t there exists a child of it labeled with a'. The construction 
ensures that R satisfies all the conditions of the definition of 
a simulation, so t < G, so P(G) is true. 

We have proved that P{Go) is true and that (VG'. G' 
G ^ P(G')) P(G), so P(G) is true for any G subgraph 
of Gg, so also for G5, hence Gg can be simulated into any 
tree t which belongs to the language of 5*. □ 
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Graph unfolding 

A pat/i in o graph (either dependency graph or universal de- 
pendency graph) G = (S, root,E) is a non-empty sequence 
of vertices starting at root such that for any two consecu- 
tive vertices in the sequence, there is a directed edge be- 
tween them in G. By Paths{G) c we denote the set 
of all the paths in G. The set of paths is finite only for 
graphs without cycles reachable from the root. For instance, 
the paths of the graph Gi in Figure [5(b)] are Paths (Gi) = 
{r, ra, rb, rc, rbd, red, rbde, rede}. 

Similarly, a path in a tree f is a non-empty sequence of 
nodes starting at roott such that any two consecutive nodes 
in the sequence are in the relation ehildt- By Paths (t) 
we denote the set of all the paths in t. Then, we define 
LabPaths (t) as the set of sequences of labels of nodes from all 
the paths in t. For instance, for the tree ti from Figure 5(a) 
we have Paths{ti) = {no, no'^i, wonin2, nona, 710^3714} and 
LabPathsiti) = {r,ra,rab}. Note that Paths(t) c iV+, 
LabPaths{t) c E+ and \LabPaths{t)\ C \Paths{t)\. 



The unfolding of a rooted graph G = (E, root,E), denoted 
wg, is a tree uc = (Nucy rootuQ, labuQ, childuc), such that: 



= Paths(G), 
rootuQ e is the root of uq, 

childuQ (p, p-a) for all paths p,p.a e Paths (G) (note that 
"." stands for concatenation), 

labu^irootuQ) = root, and labuciP-o-) 
paths p.a e Paths (G). 



a, for all the 



The unfolding of a graph is finite only when the graph has no 
cycle reachable from the root, because otherwise Paths{G) is 
infinite, so uq is infinite. In the sequel we use the unfolding 
for graphs without any cycle reachable from the root and 
in this case the unfolding is the smallest tree uq (w.r.t. the 
number of nodes) having LabPaths {uq) = Paths (G). The 
idea of the unfolding is to transform the rooted graph G into 
a tree having the child relation instead of directed edges. 
There are nodes duplicated in order to avoid nodes with 



more than one incoming edge. For instance, in Figure 5(b) 
we take the graph Gi and construct its unfolding uqi ■ We 
remark that the size of the unfolding may be exponential 
in the size of the graph, for example for the graph G2 from 



Figure 5(c) 



Extending the definition of embedding 

If a query q can be embedded in a tree t, we may write 
t q instead oi t \= q. We also extend the definition of 
embedding from a query to a tree to the embedding from 
a tree to another tree i.e., given two trees t and t' , we say 
that t' can be embedded in t (denoted t < t') if the query 
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(c) Graph G2 and its exponential unfolding. 

Figure 5: A tree and two graphs with their corre- 
sponding unfoldings. 



{Nfi , roott' , I'J'bt' , childti , 0) can be embedded in t. Similarly, 
we can define the embedding from a tree to a rooted graph. 
Note that two embeddings can be composed, for example: 

• Vt, t' e Tree. Vg e Twig, {t =^ t' a t' ^ q ^ t =i q). 

• VS' e MS. yt e Tree. e Twig. (gI;"' < t a f < g ^ 
4"' < 1)- 

Lemma A. 4 A rooted graph (dependency graph or univer- 
sal dependency graph) G = (E, root,E) can be simulated in 
a tree t iff its unfolding ug can he embedded in t: 

Proof. For the if part, we know that t < ug so there 
exists a function A : Nuq — > Nt which witnesses the embed- 



ding of Ug in t. We construct a relation R 
that: 



H X Nt such 



7? = {{root, roott)}yj 
u{(a, n) I 3p e NuQ. p.a e Nua A(p.a) = n} 

This construction ensures that for every (a, n) e R and for 
every (a, o') e E, there exists n' e A't such that (n, n') e 
childt and {a ,n') e R because the function A is a witness 
for t < ug so the child relation is simply translated from ug 
to G. The construction of R also guarantees that for every 
(a, n) e R we have lahtin) = a because A is the witness for 
t < UG and X{p.a) = n. Thus we obtain that R satisfies all 
the conditions to be a simulation of G in t. 

For the only if case, we take a relation R which witnesses 
the simulation of G in t. We construct the function A : 
—>■ Nt, witness of t < ug, by recursion on the paths of 
G, because Paths{G) = Nuq. First of all, \{rootuQ) = roott. 
We assume that we have a recursive procedure which takes 
as input a path p, a label a, and the values of the function 
A computed before the procedure call, and it outputs A(p.a). 
The invariant of the procedure is that while defining A for p.a, 



A satisfies the conditions from the definition of embedding 
for all the nodes rootuQ , . . . , p on the path to p. Furthermore, 
the values of A were obtained using the information given 
by R, so X{p) = n' iff R{labt{n'),n'). Let A(p) = n' and we 
construct A(p.a) = n, where R{a,n) and child t(n' ,n). There 
exists such a node n because of the recursive construction 
of A using R and the invariant A(p.a) = n iff R{a, n) is 
true. The construction of A ensures that A is root-preserving, 
child-preserving and label-preserving, so it satisfies all the 
conditions to be an embedding from ug to t, so we have 
found a correct witness for t < ug- □ 



Lemma A. 5 A query q can he embedded in a rooted 
(dependency graph or universal dependency graph) G iff q 
can be embedded in the unfolding tree of G. 

Proof. For the if part, we know that ua < q, so there 
exists a function X : Nq ^ Nuq witness of this embedding. 
We construct a function A' : A^q — > E, such that A'(n) = 
/a6ug(A(n)) for each node n from Nq. Since A is the witness 
of the embedding uq < q, the constructed A' satisfies all the 
conditions of the definition of an embedding from q to G. 

For the only if part, we know that G < g, so there exists a 
function A : A'q — » S witness of this embedding. We want 
to construct a function X' : Nq —f Nuq to prove ug < q- We 
construct A' by recursion on the tree structure of q. First of 
all, X'{rootq) = rootuQ- Then, the recursion hypothesis says 
that G ^ q for any connected subtree q obtained from q 
by deleting some edges, ug < q' , which is witnessed by the 
function A'. Thus, for any node n of q, A'(n) = p, where p e 
because Nu^ = Paths{G) so any node in the unfolding 
can be identified by a unique sequence of labels among the 
paths of G. For the inductive case consider that q is obtained 
from q' by adding one more edge, let it (n, n'). If it is a child 
edge and A'(n) = p, we construct A'(n') = p.X{n'), which is 
a path in G by the definition of the unfolding. Otherwise, if 
it is a descendant edge and X'{n) = p, we construct A'(n') = 
p.p .X(n'), where p' is a randomly chosen path in G from 
A(n) to A(n'). We know by definition of A that such path 
exists. The construction ensures that ug < q, for any q 
satisfying the conditions of the recursion, so we can construct 
a function A' which is a correct witness for ug < □ 

Fuse and add operations 

In Figure [S] we present the operations fuse and add. We 
say that t <lo t' if t' is obtained from t by applying one of 
the operations from Figure [S] The fuse operation takes two 
siblings with the same label and creates only one node hav- 
ing below the subtrees corresponding to each of the siblings. 
The add operation consists simply in adding a subtree at 
any place in the tree. By < we denote the transitive and 
reflexive closure of <lo- 

Note that the fuse and add operations preserve the embed- 
ding i.e., given a twig query q and two trees t and t' ,iit ^ q 
and i then t' < q. Furthermore, if we can embed a query 
g in a tree t which can be embedded in the dependency graph 
of an MS S, we can perform a sequence of operations such 
that t is transformed into another tree t' satisfying S and q 
at the same time. Formally, 

Proposition A. 6 Given an MS S, a query q and a tree t, 
if Gs < t and t ^ q, then there exists a tree t' e jC{S) n L{q). 
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Figure 6: Operations fuse and add. 

The tree t' can be constructed after a sequence of fuse and 
add operations (consistently with the schema S) from the 
tree t and we denote t t ■ 

Family of characteristic graphs 

Given a query q and a schema S, if q can be embedded in Gs 
then we can capture all the trees satisfying S and q at the 
same time with a potentially infinite family of graphs. First, 
we explain the construction of the characteristic graphs. A 
characteristic graph G for a schema S and a query g is a tu- 
ple (Vg, rootG, labG, Eg), where Vg is a finite set of vertices, 
rootG e Vg is the root of the graph, labG : Vg — > E is a label- 
ing function (with labG{rootG) = roots), and Eg Q Vg x Vg 
represents the set of edges. Note that for two a;, y e E u {*} 
we say that x matches y if y # * implies x = y. We construct 
G with the three steps described below: 

1. For any (711,712) e childq, add n'i,n'2 to Vg and (n'i,n2) 
to Eg, where labaini) matches labq{ni) and labGin'2) 
matches lahq{n2). 

2. For any (ni , 712) e descq, choose an acyclic path n'l , . . . , 
from Gs, such that n'l matches labqini) and n'^ matches 
labq{n2). We add to G the corresponding vertices and 
edges for this path, as shown for the previous case. 

3. For any n e Vg, take the subgraph from Gg starting at 
labG{n) and fuse it in the node n in the graph G. 



In Figure 7(b)| we present an example of graph obtained 
from the embedding from Figure [7(^ We denote by e(g,S') 
the set of all the graphs obtained from a query q and a 
multiplicity schema S using the three steps above, using 
all the embeddings from q into S. We extend the previous 
definition of the unfolding to the characteristic graphs. Since 
a graph G e G{q, S) is acyclic, it has a finite unfolding. From 
the definition it also follows that the size of G is polynomially 
bounded by \q\ x \S\ and G < g. 

If we allow cyclic paths in step 2, then we obtain similarly 
the set Q*{q,S). Note that |t/(g, iS)! is finite and may be 
exponential, while |C7*(g, S)] may be infinite. All the trees 
t e C{S) n C{q) can be obtained by fuse and add operations 
(consistently with S) from the unfolding trees of the graphs 
in g*{q,S): 

yt e £(5) n £{q). 3G e g*(q, S). ug <s t 

Furthermore, by using a pumping argument, we have: 

Vge Twig. yGeg*{q,S). (G ^ q ^ 3G' e g{q, S). G' ^ q). 
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(b) Graph Geg(g,5') 

Figure 7: An embedding from a query g to a depen- 
dency graph Gs and a graph G e G{q,S). In Gs, the 
non-nullable edges are drawn with a full line and the 
unliable edges with a dotted line. 

Proof of Lemma \4. 71 

(1) Given a query q and an MS S, q is satisfiable by S iff 
Gs < q. 

For the if part, we know that Gs < q, so the family of 
graphs Q{q,S) is not empty. The unfolding of any graph 
from Q{q,S) satisfies 5 and q at the same time, hence q is 
satisfiable by 5*. 

For the only if part, we know that there exists a tree t e 
jC-{S) n jC-{q), which can for example be obtained after fuse 
operations (since one occurrence is consistent to all the mul- 
tiplicities except 0) on the unfolding of a graph G from 
Q*{q,S). Since t < g, we obtain uq < g, so G < g, which, 
from the construction of G, implies that Gs < q- 



(2) Given a query q and an MS S, 

G'^s^q- 



q IS implied by S iff 



For the if part, we know that Gg < q, which implies by 
Lemma lA.51 that ug| ^ q- On the other hand, take a tree 
t e C{S). By Lemma lA.31 we have t < Gg, which implies 
by Lemma [A. 41 that t < . From the last embedding and 
WGg < g we infer that t < g. Since t can be any tree in the 
language of 5", we conclude that q is implied by S. 

For the only if part, we know that for any t e JC-{S), t ^ q. 
Naturally, ucg is in the language of S (since one occurrence 



is consistent to all the multiplicities except 0), so iig| < q- 
From the definition of the unfolding, we can infer that Gg < 



iiG| , which implies that Gg < q. 



□ 



Proof of Theorem[ 

The embedding from a query to a graph can be tested in poly- 
nomial time with a simple bottom-up algorithm. From this 
observation and Lemma 14.71 we conclude that S AT ms, Twig 
and IMPLms.t^s are in PTIME. □ 

Proof of Theorem \4.9\ 

Theorem 4 from 15 implies that CNTMS,Twig is coNP-hard. 
Proving the membership of the problem to coNP is, however, 
not trivial. Given an instance (p, q, S), a witness is a func- 
tion A : Np E. Testing whether A is an embedding from 
p to Gs requires polynomial time. If A is an embedding, a 
non-deterministic polynomial algorithm chooses a graph G 
from Q{p,S) and checks whether q can be embedded in G. 
We claim that: 



Vts q 



3Gee(p,5). G^q 



For the if case, we assume that there exists a graph G e 
Q{p, S) such that G ^ q. We know that G < p, so ug < p, 
so there exists a tree t e C{S) such that t ^ p and uq <s t 
(using only fusions since one occurrence is consistent to all 
the multiplicities except 0). If we assume by absurd that 
t ^ q, we have uq < g, so G < g, which is a contradiction. 
We infer thus that there exists a tree t e C{S) n £{p), such 
that t f J~.(q), so p g. 

For the only if case, we assume that p g, so there exists 
a tree t e £,{S) n £,{p) such that t f C{q). Because t G 
£(5*) n C(p), we know that there exists a graph G s Q* [p, S), 
such that UG ^st- We know that t ^ g, so uq =^ g, so G ^ g. 
Moreover, we know using the pumping argument that in this 
case there exists a graph G' e Q (p, 5") such that G' =^ g. □ 

Disjunction-free DTDs 

Similarly to the DMS, we represent a disjunction-free DTD 
as a tuple S = {roots, Rs), where roots is a designed root 
label and Rs maps symbols to regular expressions using no 
disjunction i.e., regular expressions of the form: 



E 



a\ E* \ e' \E* \Ei-E2 



where a £ E. Given such an expression E, consider the set 
non_nullable{E) which contains the set of labels present in 
all the words from C{E). Formally, 

non_nullable{E) = {a e E | G C{E). 3toi, W2. w = wi-a-W2} 
We can compute non_nullable{E) recursively: 

non_nullable{e) = non_nullable{E*) = non-nullable{E') = 
non_nullable{a) = {a} 

non_nullable{Ei ■ E2) = non_nullable{Ei) u non_nullable{E2) 
non_nullable{E^ ) = non_nullable{E) 

Similarly, let nullable{E) the set containing labels which ap- 
pear in at least one word from C{E). Formally, 

nullable{E) = {a e E | 3w e C{E). lw\,W2- w = w\ ■ a ■ W2} 



We can compute nullable{E) recursively: 

nullable{e) = 
nullable{a) = {a} 

nullable{E+'*''') = nullable{E) 

nuUable{Ei ■ E2) = nullable{Ei) u nullable{E2) 

Next, we adapt the notions of dependency graph and uni- 
versal dependency graph for disjunction-free DTDs. The 
dependency graph of a disjunction-free DTD 5 is a rooted 
graph Gs = {Y:, roots, Es), where 

Es = {{a, a) I a e nullable{Rs (a))} . 

Similarly, the universal dependency graph of a disjunction- 
free DTD S is a rooted graph G5 = (E, roots, Es), where 

Es = {{a, a') I a' e non_nullable{Rs{a))}. 

We assume w.l.o.g. that from now on we manipulate only 
disjunction-free DTDs having no cycle in the universal de- 
pendency graph. Otherwise, if there is a cycle in the univer- 
sal dependency graph, this means that there does not exist 
any tree consistent with the schema and containing any of 
the labels implied in that cycle. 

For a symbol a e E and a disjunction-free regular expres- 
sion E, by mm_nb{E, a) we denote the minimum number of 
occurrences of the symbol a in any word consistent with E. 

min_nb{e,a) = min_nb(E* , a) = min_nb(E' ,a) = 
min_nb{a, a) = 1 

min_nb{Ei ■ E2,a) = min_nb{E\, a) + min_nb{E2, a) 
min_nb(E^ , a) = min_nb(E, a) 

We adapt the definition of unfolding for the (universal) de- 
pendency graph of a disjunction-free DTD. For a disjunction- 
free multiplicity schema, the unfolding of the universal de- 
pendency graph belongs to its language since one occurrence 
is consistent with all the multiplicities except 0. On the 
other hand, for a disjunction-free DTD S this property does 
not hold, so we extend the construction of the unfolding with 
one more step: 

• Let be the unfolding of Gg obtained as it is defined 
for the MS. 

• Update uq^ such that for any n e Nu„u , for any a e E, 
let ta the subtree having as root the child of n labeled 
by a. Next, add copies of ta as children of n until n has 
min_nb{Rs{labuf^ii {n)),a) children labeled with a. 

Note that a consequence of this new definition is that the 
unfolding of the universal dependency graph of a disjunction- 
free DTD belongs to its language (modulo the order of the 
elements). The order imposed by the DTD on the elements 
is not important because in the sequel we work with twig 
queries, which ignore this order. 

Proof of Corollarv KM 

We claim that a query q is implied by a disjunction- free 
DTD S iff G5 < q and since the embedding of a query in a 
graph can be computed in polynomial time, this implies that 



IMPL(iisj_/ree DTD, Twig is iu PTIME. The proof follows imme- 
diately from the proof of Lemma [4.7r 2'). taking into account 
the new definition of the unfolding. Theorem 4 from [15] 
implies that CNTdisj-Zree DTD,Twig is coNP-hard. The mem- 
bership of CNTdiaj-free DTD, Twig tO CoNP foUoWS frOm the 

proof of Theorem 14.91 while taking into account the new 
definition of the unfolding. □ 



